Skip to content

Voxtral Realtime: enable CUDA backend with int4 quantization#17798

Merged
mergennachin merged 1 commit intomainfrom
enable_voxtral_realtime
Mar 4, 2026
Merged

Voxtral Realtime: enable CUDA backend with int4 quantization#17798
mergennachin merged 1 commit intomainfrom
enable_voxtral_realtime

Conversation

@mergennachin
Copy link
Contributor

Add CUDA/AOTI backend support for the Voxtral Realtime model alongside
the existing XNNPACK and Metal backends.

Model (model.py):

  • CudaSDPA: F.scaled_dot_product_attention with repeat_interleave for
    GQA expansion and boolean attention masks (Triton SDPA requirement)
  • StaticKVCache (shared with Metal) for [B,H,S,D] layout with index_copy_
  • StandardEncoderRingKVCache/StandardEncoderSDPA for streaming encoder
  • _build_causal_mask_bool: 4D boolean mask for Triton compatibility
  • Simplified LMAttention.forward to always pass attn_mask (None for XNNPACK)

Export (export_voxtral_rt.py):

  • --backend cuda with CudaPartitioner and conv1d_to_conv2d decomposition
  • --dtype flag (default fp32, bf16 for CUDA Triton SDPA)
  • --qlinear-packing-format / --qlinear-encoder-packing-format for
    tile_packed_to_4d int4 quantization
  • CUDA device placement, Dim.AUTO for audio encoder, .ptd output

Runner (main.cpp, voxtral_realtime_runner.cpp/.h):

  • --data_path flag for .ptd delegate data (CUDA compiled kernels)
  • Module two-arg constructor for pte+ptd loading

Build (CMakePresets.json, Makefile):

  • voxtral-realtime-cuda preset
  • make voxtral_realtime-cuda target

CI (.github/workflows/cuda.yml, .ci/scripts/):

  • Voxtral Realtime in CUDA CI matrix (int4-tile-packed, offline mode)
  • Export/test scripts updated for CUDA quantization args and data path

Copilot AI review requested due to automatic review settings March 2, 2026 22:30
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 2, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17798

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 1 Cancelled Job

As of commit e5c3690 with merge base 0907294 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 2, 2026
@github-actions
Copy link

github-actions bot commented Mar 2, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@mergennachin mergennachin temporarily deployed to upload-benchmark-results March 2, 2026 23:33 — with GitHub Actions Inactive
@mergennachin mergennachin force-pushed the enable_voxtral_realtime branch from 1e5399a to afe08f0 Compare March 3, 2026 15:18
@mergennachin mergennachin temporarily deployed to upload-benchmark-results March 3, 2026 16:11 — with GitHub Actions Inactive
@mergennachin mergennachin force-pushed the enable_voxtral_realtime branch from afe08f0 to 50e3a3d Compare March 3, 2026 16:48
Copilot AI review requested due to automatic review settings March 3, 2026 16:48
@mergennachin mergennachin force-pushed the enable_voxtral_realtime branch from 50e3a3d to e708015 Compare March 3, 2026 16:57
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

examples/models/voxtral_realtime/voxtral_realtime_runner.cpp:612

  • logits_to_token() recreates/reseeds a Sampler on every decode step (seeded from std::time(nullptr)), so temperature > 0 sampling won’t have a stable RNG stream across tokens and can become repetitive. Since StreamingSession already has a sampler_ member, it would be better to use that persistent sampler (with dtype switching for Float/BFloat16/Half) instead of calling logits_to_token() each step.
      prev_token_, static_cast<uint64_t>(next_token));
  if (piece.ok()) {
    token_cb_(*piece);
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 286 to +292
python -m executorch.examples.models.voxtral_realtime.export_voxtral_rt \
--model-path "$LOCAL_MODEL_DIR" \
--backend "$DEVICE" \
${STREAMING_ARG} \
--output-dir "${OUTPUT_DIR}" \
${VR_QUANT_ARGS}
${VR_QUANT_ARGS} \
${VR_DTYPE_ARGS}
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the voxtral_realtime export path, the script doesn’t validate that the CUDA delegate data file (aoti_cuda_blob.ptd) was produced. Since the runner requires --data_path for CUDA, it’d be safer to add a test -f "${OUTPUT_DIR}/aoti_cuda_blob.ptd" check when DEVICE=cuda (similar to the Parakeet branch) so export failures are caught immediately.

Copilot uses AI. Check for mistakes.
@mergennachin mergennachin force-pushed the enable_voxtral_realtime branch from e708015 to 05f0ed2 Compare March 3, 2026 17:17
@mergennachin mergennachin temporarily deployed to upload-benchmark-results March 3, 2026 18:28 — with GitHub Actions Inactive
|---------|---------|-----------|--------------|
| `xnnpack` | ✓ | ✓ | `4w`, `8w`, `8da4w`, `8da8w` |
| `metal` | ✓ | ✓ | none (fp32) or `fpa4w` (Metal-specific 4-bit) |
| `cuda` | ✓ | ✓ | `4w`, `8w`, `8da4w`, `8da8w` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Cuda support 8da4w/8da8w?

Related, I'm pretty sure xnnpack does not support 4w/8w.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@metascroy

Does Cuda support 8da4w/8da8w?

Good catch, will fix.

Related, I'm pretty sure xnnpack does not support 4w/8w.

xnnpack supports per-channel 4w and 8w. For example, we use 8w for token embeddings.

Copy link
Contributor

@metascroy metascroy Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ET's embedding CPU op supports weight only schemes, but I don't think xnnpack supports weight-only quantization for linear layers.

With that said, 4w/8da4w and 8w/8da8w quantize weight data the same. The only difference is the 8da variants add fake activation quantization in front.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manuelcandales is there any plan for metal aoti to use int4/int8 for a more uniform experience.

The kernel should support it because I'm using int4/int8 with MLX.

--model-path ~/models/Voxtral-Mini-4B-Realtime-2602 \
--backend cuda \
--dtype bf16 \
--streaming \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is supported, then why not test it in CI?

fi
source .ci/scripts/export_model_artifact.sh cuda "${{ matrix.model.repo }}/${{ matrix.model.name }}" "${{ matrix.quant }}" "${RUNNER_ARTIFACT_DIR}"
# Voxtral Realtime uses offline mode for CUDA CI (not streaming)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not streaming?

Add CUDA/AOTI backend support for the Voxtral Realtime model alongside
the existing XNNPACK and Metal backends.

Model (model.py):
- CudaSDPA: F.scaled_dot_product_attention with repeat_interleave for
  GQA expansion and boolean attention masks (Triton SDPA requirement)
- StaticKVCache (shared with Metal) for [B,H,S,D] layout with index_copy_
- StandardEncoderRingKVCache/StandardEncoderSDPA for streaming encoder
- _build_causal_mask_bool: 4D boolean mask for Triton compatibility
- Simplified LMAttention.forward to always pass attn_mask (None for XNNPACK)

Export (export_voxtral_rt.py):
- --backend cuda with CudaPartitioner and conv1d_to_conv2d decomposition
- --dtype flag (default fp32, bf16 for CUDA Triton SDPA)
- --qlinear-packing-format / --qlinear-encoder-packing-format for
  tile_packed_to_4d int4 quantization
- CUDA device placement, Dim.AUTO for audio encoder, .ptd output

Runner (main.cpp, voxtral_realtime_runner.cpp/.h):
- --data_path flag for .ptd delegate data (CUDA compiled kernels)
- Module two-arg constructor for pte+ptd loading

Build (CMakePresets.json, Makefile):
- voxtral-realtime-cuda preset
- make voxtral_realtime-cuda target

CI (.github/workflows/cuda.yml, .ci/scripts/):
- Voxtral Realtime in CUDA CI matrix (int4-tile-packed, offline mode)
- Export/test scripts updated for CUDA quantization args and data path
Copilot AI review requested due to automatic review settings March 4, 2026 04:43
@mergennachin mergennachin force-pushed the enable_voxtral_realtime branch from 05f0ed2 to e5c3690 Compare March 4, 2026 04:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 539 to +570
@@ -474,15 +552,22 @@ def main():

os.makedirs(args.output_dir, exist_ok=True)

# Load model
model_dtype = {"fp32": torch.float32, "bf16": torch.bfloat16}[args.dtype]

print("Loading model...")
model = load_model(
args.model_path,
max_seq_len=args.max_seq_len,
n_delay_tokens=args.delay_tokens,
dtype=model_dtype,
backend=args.backend,
)

# Move to CUDA for CUDA backend export (AOTInductor needs CUDA tensors)
if args.backend == "cuda":
print("Moving model to CUDA...")
model.cuda()

Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For --backend cuda, leaving --dtype at the current default (fp32) is likely to produce an exported model that fails at runtime/compile time once SDPA is replaced by the CUDA Triton triton::sdpa op, which currently enforces bfloat16 inputs. Consider either (a) making bf16 the default when --backend cuda, (b) erroring out if --backend cuda and --dtype fp32, or (c) automatically setting a CUDA compile spec (e.g., triton_kernel_mode=OFF) when exporting fp32 so SDPA falls back to a non-Triton implementation.

Copilot uses AI. Check for mistakes.
Comment on lines +301 to +304
# Add CUDA data path if present
if [ "$DEVICE" = "cuda" ] && [ -f "${MODEL_DIR}/aoti_cuda_blob.ptd" ]; then
RUNNER_ARGS="$RUNNER_ARGS --data_path ${MODEL_DIR}/aoti_cuda_blob.ptd"
fi
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block appends --data_path ... for CUDA, but the script already adds --data_path ${MODEL_DIR}/aoti_cuda_blob.ptd for all non-llama runners earlier (before the model-specific case). For Voxtral Realtime on CUDA this results in duplicate --data_path arguments. Please remove this per-model addition (or refactor the earlier common CUDA handling to avoid double-appending for voxtral_realtime).

Suggested change
# Add CUDA data path if present
if [ "$DEVICE" = "cuda" ] && [ -f "${MODEL_DIR}/aoti_cuda_blob.ptd" ]; then
RUNNER_ARGS="$RUNNER_ARGS --data_path ${MODEL_DIR}/aoti_cuda_blob.ptd"
fi

Copilot uses AI. Check for mistakes.
@mergennachin mergennachin temporarily deployed to upload-benchmark-results March 4, 2026 05:42 — with GitHub Actions Inactive
@mergennachin mergennachin merged commit 5193141 into main Mar 4, 2026
374 of 379 checks passed
@mergennachin mergennachin deleted the enable_voxtral_realtime branch March 4, 2026 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants